A lower score means a better match. As expected, a survey of all heads in the model shows that head 1.5 is an outlier with a uniquely low score, confirming it’s specialized for this task.
TL;DR: GPT-2′s head 1.5 directs attention to semantically similar tokens and actively suppresses self-attention. This mechanism is driven by a symmetric bilinear form with negative eigenvalues, which enables suppression. The head computes attention purely based on token identity, independent of position. We cluster tokens semantically, interpret the weights to explain the attention scores, and steer self-suppression by tuning eigenvalues.
I’d like to aknowledge that I did this work during my time at ARENA, but I’m not sure whether I should do it at beginning or end
I will replace this image
I’d like to aknowledge that I did this work during my time at ARENA, but I’m not sure whether I should do it at beginning or end